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A review of post-GWAS prioritization approaches 



In the recent decade, high-throughput genotyping and next-generation sequencing plat- 
forms have enabled genome-wide association studies (GWAS) of many complex human 
diseases. These studies have discovered many disease susceptible loci, and unveiled 
unexpected disease mechanisms. Despite these successes, these identified variants only 
explain a small proportion of the genetic contributions to these diseases and many more 
remain to be found. This is largely due to the small effect sizes of most disease-associated 
variants and limited sample size. As a result, it is critical to leverage other information to 
more effectively prioritize GWAS signals to increase replication rates and better understand 
disease mechanisms. In this review, we introduce the biological/genomic features that have 
been found to be informative for post-GWAS prioritization, and discuss available tools to 
utilize these features for prioritization 
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INTRODUCTION 

With the developments of affordable and reliable high-throughput 
genotyping and next-generation sequencing platforms, many 
genome-wide association studies (GWAS) have been successfully 
conducted to identify DNA variants associated with many com- 
plex human diseases and traits, such as cancer, autoimmune 
diseases, height, blood pressure, body mass index, among others. 
As of 11/16/13, there were 11,907 single nucleotide polymor- 
phisms (SNPs), 940 traits with 15,052 associations documented in 
the GWAS catalog, maintained by the National Human Genome 
Research Institute (Hindorff etal., 2009). These studies have 
uncovered many novel genes and implicated unexpected pathways 
associated with disease mechanisms, leading to great insights on 
disease etiology. 

In spite of these accomplishments, many challenges remain in 
GWAS design and analysis. The first limitation is the limited statis- 
tical power to identifj' all disease-associated loci. Although many 
susceptible loci have been identified, they only explain a small 
fraction of the overall heritabUity, with the majority of heritability 
remaining unexplained. One possible reason is that the missing 
heritability is due to the lack of coverage of genetic variations 
on the genotyping platforms, such as those rare or even private 
variations. Another explanation is that most disease-associated 
variants have small effect sizes that are not likely detected due 
to low statistical power, even with thousands of subjects. To 
better identify these variants, more powerful and cost-effective 
designs and statistical methods are desired. Several approaches 
have proved cost-effective to enrich signals and increase statis- 
tical power. For example, a number of customized genotyping 
platforms have been used to target certain genomic regions with 
high density to fine map disease-associated variants. The success- 
fiil examples include the use of the ImmunoChip (Trynka etal, 
2011)to fine map 186 distinctloci associated with 12 autoimmune 
diseases, and the use of the MetaboChip (Voight etal, 2012) to 
fine map established trait-associated loci. As for rare variant anal- 
ysis, a number of whole exome sequencing studies have enjoyed 



success for diseases like autism and schizophrenia (Xu etal., 2011; 
Sanders etal., 2012). It has also been found that studies focusing 
individuals with extreme phenotypes can increase statistical power 
because of the enriched genetic signals in the study subjects (Lin 
et al., 2013). In addition to improved platforms and study designs, 
many statistical methods have also been developed to increase sta- 
tistical power. Meta-analysis is commonly applied to leverage all 
the information from separate studies to increase statistical power 
to identify disease-associated loci. A number of methods aim to 
investigate the combinatorial effects of a group of SNPs, including 
both marginal and interaction effects. These are accomplished by 
explicitly modeling the interactions of two or more SNPs; joint 
analysis of all the SNPs in a gene or defined region; and joint con- 
sideration of the SNPs in proximity of all the genes annotated in 
a specific pathway/network. The advances in study designs and 
statistical methods have led to many novel discoveries. For exam- 
ple, a recent study of the inflammatory bowel disease (IBD; jostins 
etal., 2012), which is a meta-analysis of both ImmunoChip and 
GWAS datasets, increased the number of IBD loci to 163, where 
these loci account for 13.6% of the genetic risk of Crohn's disease 
(CD) and 7.5% of ulcerative colitis. See Cantor etal. (2010) for a 
comprehensive review of some of the topics. 

The second limitation is the difficulty to interpret the biolog- 
ical relevance of susceptible loci and link them with the disease 
etiology. Because the ultimate goal of association studies is to 
understand disease etiology and develop effective strategies to pre- 
vent and treat diseases, it would be desirable that GWAS results 
can implicate functional variants and disease pathways that can 
be experimentally studied in follow-up functional studies. How- 
ever, a large proportion of disease-associated variants fall into 
non-coding regions of the genome, with 88% of the associated 
SNPs in GWAS catalog non-coding (Hindorff et al, 2009), making 
it difficult to form testable hypothesis. Even when the variants in 
the coding region, it is often not clear whether they are functional 
due to the presence of several closely linked variants. To address 
this problem, many statistical methods have been proposed to 
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prioritize GWAS signals by incorporating diverse functional evi- 
dence, so that variants with small effect sizes but possessing 
functional features may be prioritized over variants with simi- 
lar effect sizes but less likely to be functional. GWAS signals can 
be prioritized at both the SNP level and gene level, depending 
on the biological features considered and the input signals. Sta- 
tistical approaches that prioritize at the SNP level are especially 
helpful in pinpointing the causal variants with sequence data, 
where essentially all the variations in the genome can be iden- 
tified. This is contrast to earlier GWAS that only interrogated 
a subset of SNPs, such as tag SNPs, in genotyping platforms. 
One benefit of such approaches is that the functional evidence 
provides paths to derive plausible and testable hypotheses for 
the prioritized genes or loci. Moreover, with the incorporation 
of other data informative about disease association, the pri- 
oritized genes/loci are more likely to be truly associated with 
disease. For example, it has been observed that trait-associated 
loci are more concentrated in regions with certain genomic fea- 
tures, such as protein coding regions and expression quantitative 
trait loci (eQTL). In this review, we will review (1) biologi- 
cal/genomic features that are informative for prioritizing GWAS 
results; and (2) computational methods and tools that prioritize 
disease-associated SNPs by integrating these biological/genomic 
features. 

BIOLOGICAL FEATURES USED IN PRIORITIZATION AND 
THEIR JUSTIFICATIONS 

The first step, and sometimes the only step of many SNP prioritiza- 
tion approaches is to annotate the candidate SNPs by intersecting 
GWAS signals with desired genomic features, such as eQTLs, tran- 
scription factor binding sites, DNase hypersensitive sites, histone 
modifications, and others. For CD, Fransen etal. (2010) showed 
that ds-eQTL SNPs were enriched in known CD-associated SNPs. 
Based on this observation, the authors proposed to select a sub- 
set of SNPs to follow-up a public CD GWAS dataset, to intersect 
the top 500 GWAS hits with czs-eQTLs in an eQTL database. The 
SNPs thus selected, ds-eQTLs of the genes UBE2L3 and BCL3, were 
replicable in two independent replication cohorts. This represents 
a successful application of the annotation-based prioritization 
methods. 

The genomic features may implicate functional roles of the 
prioritized SNPs to disease etiology, and these hypotheses can be 
formally tested through molecular studies. These include variants 
both in coding and non-coding regions. Through these filter- 
ing/intersecting methods, researchers can focus on a much smaller 
number of SNPs in follow-up studies. Although the proximity of a 
SNP with a documented genomic feature may suggest a functional 
role of the SNP, it may not necessarily increase the probability 
that it will affect the phenotype of interest, nor the probability 
that the locus is truly susceptible. In general, the genomic fea- 
tures discussed above are considered biologically plausible and 
extensively used to prioritize SNPs, but whether each feature is 
informative on a SNP's functional relevance is disease and context 
dependent. In a recent study, Minelli et al. (2013) tried to identify 
features that are important in selecting SNPs for follow-up stud- 
ies by surveys in experts. They sent questionnaires to ten experts 
who conducted GWAS studies, and asked their opinion on the 



importance of a set of selected features. The features included 
relative position of the SNP to a nearby transcript, whether the 
SNP causes an amino acid change, etc. (see Table 2 in their 
paper). The result was not surprising, as experts considered gene 
level evidence more important, such as the SNP in a gene that 
is previously associated with the phenotype, or that encodes a 
protein in a phenotype related pathway, or that has gene/protein 
interaction relevant to the phenotype. Experts opinions are valu- 
able, however, they might be biased toward existing knowledge 
and also expertise in specific diseases. Nevertheless, this paper 
highlighted the need for understanding what features should be 
considered in prioritization. In the following, we review these 
features and statistical methods to use these features, to inform 
human geneticists in their applications of the annotation-based 
approaches. 

EXPRESSION QUANTITATIVE TRAIT LOCI 

By contrasting the SNPs documented in the GWAS catalog 
(Hindorff etal., 2009) with those randomly sampled SNPs with 
matching minor allele frequency distribution, Nicolae et al. (2010) 
showed that complex trait-associated SNPs are more likely to be 
eQTLs. The conclusion remained valid for a linkage disequilib- 
rium (LD) -pruned subset of SNPs in the GWAS catalog. Since the 
eQTL annotation considered by Nicolae et al. (2010) was derived 
from an expression dataset of lymphoblastoid cell lines, it was 
of interest to investigate whether cell line-specific eQTLs showed 
different levels of enrichments across diseases with different focal 
tissues, including cancer, neurological/psychiatric disorders, and 
autoimmune disorders. By tissue of relevance, the lymphoblastoid 
cell lines should be a good proxy for autoimmune disorders, and 
relatively poor for cancer and neurological/psychiatric disorders. 
As expected, there was greater enrichment of eQTLs in the group 
of autoimmune disorders, while only moderate enrichment for 
the other two groups of diseases. Furthermore, in the examination 
of the results in the Wellcome Trust Case Control Consortium 
(WTCCC) GWAS dataset of CD, eQTLs were enriched in SNPs 
with association p-value less than 0.01, but not for the missense 
SNPs, indicating potential loss of information if non-coding SNPs 
are ignored. 

PROTEIN DELETERIOUSNESS PREDICTIONS 

Polymorphisms in the coding region may have different effects 
on protein function. Synonymous SNPs do not change the corre- 
sponding protein sequence; non-synonymous SNPs change amino 
acid composition, or truncate the protein sequence by causing 
an early codon; indels can change protein sequence with vary- 
ing consequence depending on whether the indel is in-frame or 
frame-shifting; SNPs and indels can also interrupt splicing sites, 
thus change the mRNA isoform. In other words, mutations in 
the coding region may be benign or deleterious to protein func- 
tion, with deleterious mutations more likely to have a phenotypic 
effect. Many computational tools have been developed to pre- 
dict "deleteriousness" of SNPs and indels (Ng and Henikoff, 2003; 
Adzhubei et al, 2010). These methods generally take features like 
biochemical property of the altered amino acid, conservation and 
sequence homology as input, and use machine learning technique 
to train a classifier. These methods have been comprehensively 
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reviewed by Cooper and Shendure (2011) and Ng and Henikolf 
(2006). 

The most extreme case of protein function interruption is the 
loss of function mutations. However, genome sequencing studies 
found that all human carry loss of function mutations without 
obvious phenotypic effect, and common loss of function variants 
were depleted in polymorphisms associated with complex disease 
like CD and rheumatoid arthritis (MacArthur etal, 2012). The 
results indicate that the "deleteriousness" feature should be inter- 
preted with caution, since disruption of protein function does not 
necessarily have a phenotypic effect. In this regard, the "resid- 
ual variance intolerance score" has been defined quantitatively 
measure the tolerance of a protein to mutations (Petrovski etal, 
2013). The number of missense and non-sense variants found in 
each gene in the cohort of the National Heart, Lung, and Blood 
Institute (NHLBI) exome sequencing project was compared to the 
number of functionally neutral variants, and genes with variants 
fewer than expected are assigned a negative score, indicating they 
are less tolerant to variations. 

DIFFERENTIAL GENE EXPRESSIONS 

Gene expression microarrays and RNA-seq are commonly used to 
study gene expression profiles in disease cases and matched con- 
trols, and differentially expressed genes thus identified may suggest 
disease mechanisms and potential biomarkers that can be further 
explored in follow-up studies. Chen etal. (2008) analyzed 476 
expression datasets in the Gene Expression Omnibus (GEO), and 
calculated the frequency that a gene was differentially expressed 
in these datasets, which they called "differential expression ratio." 
They found that differential expression ratio is positively corre- 
lated with the likelihood that a gene harbors disease-associated 
variants, where the list of disease-associated genes was created by 
combining information from Genetic Association Database (GAD; 
Becker et al., 2004) and Human Gene Mutation Database (HGMD; 
Stenson et al., 2003). In addition, they found that among the genes 
discovered in the initial scan of the WTCCC type 1 diabetes mel- 
litus GWAS dataset, the differential expression ratio was higher in 
genes that were replicable than those not replicable in follow-up 
studies. These authors have developed an online server, FitSNPs, 
to incorporate this feature (see Table 1). 

DNase I HYPERSENSITIVE SITES 

DNase I hypersensitive sites (DHSs) are markers of accessible 
chromatin, which indicate regulatory roles in the transcription 
process. DHS have been mapped in 349 cell and tissue sam- 
ples genome-wide by next-generation sequencing (Thurman et al, 
2012). Enrichment analysis showed that trait-associated SNPs in 



Table 1 | A list of online SNP prioritization tools. 



Name Website Reference 

FASTSNP http://fastsnp.ibms.sinica.edu.tw Yuan etal. (2006) 

FitSNPs http://fitsnps.stanford.edu/fitSNPs.php Chen etal. (2008) 

SNPranker2.0 http://www.itb.cnr.it/snpranker Merelli etal. (2013) 

SPOT http://spot.cgsmd.isi.edu Saccone etal. (2010) 



the GWAS catalog (Hindorff etal., 2009) are more concentrated 
within DHS regions, excluding confounding factors such as allele 
frequency and distance from the nearest transcriptional start site 
(Maurano etal, 2012). 

OTHERS 

There are many more genomic features collected and annotated 
in large community projects, such as the Encyclopedia of DNA 
Elements (ENCODE; Consortium, 2011), which are potentially 
valuable for SNP prioritization. Kindt etal. (2013) examined 
enrichment or depletion of trait-associated SNPs in 58 genomic 
features. The features investigated covered genie and regulatory 
features, conservation features, and chromatin state features (see 
Table 1 in Kindt etal, 2013). Among those features, genomic 
regions annotated as "heterochromatin" and "low expression sig- 
nals" are depleted of trait-associated SNPs, while eQTLs and 
"strong enhancer" showed the highest level of enrichment. 

The biological features discussed so far are measured/inferred 
from laboratory cell lines and the sequence and annotation of 
the human genome, which do not provide trait-specific infor- 
mation. However, trait-relevant features are intrinsically helpful 
for prioritization. For example, a DNase-seq experiment in intes- 
tine tissues and immune cells of CD patients would be more 
informative for prioritizing variants associated with CD than 
those measured in brain tissues. Maunakea etal. (2010) and 
Portela and Esteller (2010) reviewed recent progress in mapping 
the epigenome (including DNA methylation and histone modi- 
fication), showing that epigenetic modifications play important 
roles in human diseases, including cancer, neurodevelopmen- 
tal disorder, neurodegenerative disease, neurological disease, and 
autoimmune diseases. Thus, epigenome data in disease states is 
valuable for understanding disease and prioritize disease suscepti- 
ble loci. However, efforts in disease-specific epigenome mapping 
and systematic database to document such data are stiU lack- 
ing and greatly needed. For DNA methylation alone, a database, 
DiseaseMeth, has incorporated methylation data for 72 human 
diseases (Lv etal, 2012). 

SNP PRIORITIZATION APPROACHES AND AVAILABLE WEB 
SERVERS 

Here we review methods and tools that prioritize GWAS signals 
at single SNP level. There are mainly two steps in these meth- 
ods. The first step applies annotations or filters based on whether 
or not the candidate SNP has the desirable features and the sec- 
ond step scores the candidate SNPs by integrating evidence from 
multiple sources. Saccone etal. (2008, 2010) developed an online 
prioritization tool, SPOT, which systematically combines multiple 
biological databases to prioritize SNPs by the genomic informa- 
tion network (GIN) model (see Table 1). In their model, each 
SNP is assigned a prioritization score, which is a linear sum of 
scores derived from pathway information, comparative genomics, 
linkage scan, and results of other independent GWAS studies. 
The weights are decided by the strength of the link between the 
SNP and the annotations. For example, for a SNP that is in LD 
with a non-synonymous SNP of a susceptible gene, the assigned 
weight will be less than that of SNPs physically in the gene. The 
methodology prioritized SNPs with increased biological relevance 



www.frontlersin.org 



December 2013 | Volume 4 | Article 280 | 3 



Hou and Zhao 



Post-GWAS prioritization 



in a GWAS study of nicotine dependence. Thompson et al. (2013) 
incorporated biological features in a Bayesian framework, where 
the prior probability that a SNP is associated with the phenotype 
is determined by its annotations. They first curated a training set, 
including SNPs that were confirmed replicable as the positive set, 
and 1,000 randomly selected SNPs as control set. For a selected 
set of features, a logistic regression model was fit for each disease. 
Thus, the log odds ratio that a test SNP is associated with the 
disease can be estimated through the model. 

There are also web servers that perform SNP prioritization in 
an annotation fashion. They annotate the candidate SNPs by single 
or multiple features, but do not combine the results. They differ by 
the features and strategies they use in prioritization. A list of SNP 
prioritization resources are provided in Table 1 . Most of these tools 
are only applicable to SNPs, and tools that can prioritize indels are 
still lacking. 

FASTSNP uses a decision tree framework to assign different risk 
level to SNPs by considering the genomic location and functional 
effect of the SNPs (Yuan et al, 2006). 

FitSNPs calculated a differential expression ratio for all genes 
in the genome, and prioritize SNPs by the differential expression 
ratio of their associated genes (Chen et al., 2008). 

SNPranker 2.0 first annotates the SNPs with different features, 
and then user a user interactive way to integrate features (Chen 
etal., 2008). Users can specify the features they want to include 
and the weight of each feature, which would give the users an 
opportunity to enforce their biological priors. But they also pro- 
vide an optimal set of weights by default. The optimal weights 
were determined by a cross-validation approach. 

TOOLS FOR VARIANT ANNOTATION 

Besides the SNP prioritization tools, there are also many web 
servers and software for variant annotation (Table 2), which could 
provide useful information for prioritization. Basically, these tools 
take a list of query variants as input, and annotate them with their 
in-house databases. Among these, HaploReg (Ward and Kellis, 
2012a) and RegulomeDB (Boyle etal., 2012) provide annotation 
for variations in non-coding regions. HaploReg annotates vari- 
ations by their chromatin state, conservation across mammals, 
and computationally predicted transcription factor binding sites. 
Besides, by utilizing LD information from the 1000 Genomes 
Project, HaploReg automatically reports, and annotates all vari- 
ations within a user-specified LD threshold of the query variant. 
RegulomeDB has incorporated many data sources, including the 



ENCODE project, available transcription factor ChlP-seq data, 
and eQTL datasets. The other tools are designed for variations in 
the whole genome. They annotate the query variations by dbSNP 
ID, allele frequency in different ethnic groups, position in a tran- 
script (intron, exon, 5' UTR, etc.,), and the resultant amino acid 
change if any. SeattleSeq (Ng etal., 2009) and Variant Effect Pre- 
dictor (VEP; McLaren etal., 2010) has convenient web interface, 
suitable for users who are not familiar with scripts and program- 
ming languages, while ANNOVAR (Wang et al, 2010) and Snpeff 
(Cingolani etal., 2012) are stand-alone software packages, so that 
they can be easily incorporated into variant analysis pipelines. Dis- 
cussions on variant annotation tools can also be found in Ward 
and KeUis (2012b). 

DISCUSSION 

In this review, we have focused on biological and genomic fea- 
tures that are informative for SNP prioritization. In the second 
phase of association studies, researchers can use these databases or 
tools to choose SNPs in foUow-up studies. The observation that 
eQTLs and open chromatin regions are enriched of trait-associated 
SNPs highlights the potential rich information in the non-coding 
regions of the genome (Nicolae etal., 2010). Nonetheless, the 
gene-centric approaches may be helpful in disease gene discov- 
ery, and there are many approaches that perform prioritization at 
the gene level. A review of the methods and tools for gene-based 
prioritization can be found in Tranchevent et al. (201 1). 

The prioritization methods discussed here take as input a list 
of candidate SNPs, which is usually derived by taking all the SNPs 
achieving a specified significance level in GWAS, e.g., all SNP 
with p-values less than 0.01. The SNPs are treated equally regard- 
less of the association p-values. However, the p-value, the effect 
size, and other statistics that summarize the association level of 
individual SNP could be informative for SNP selection. A com- 
putational framework that incorporates the significance level with 
the biological/genomic features discussed above might improve 
the performance of the prioritization scheme. A discussion of dif- 
ferent signal measures of association was given by Stromberg et al. 
(2009). 

Although there are many web servers and databases for SNP 
prioritization, most of them provide annotations of different types 
of features, but do not rank these SNPs through integrating GWAS 
and annotation information. Also, these methods do not employ 
disease-specific information. There is still a great need for sta- 
tistical methods that select and integrate multiple annotations in 



Table 2 | A list of tools for variant annotation. 



Name 



Website 



Reference 



ANNOVAR 

HaploReg 

RegulomeDB 

SeattleSeq 

Snpeff 

VEP 



http://www.openbioinformatics.org/annovar/ 

http://www.broadinstitute.org/mammals/haploreg/haploreg.php 

http://regulome.stanford.edu/ 

http://snp.gs.washington.edu/SeattleSeqAnnotation137/ 
http://snpeff.sourceforge.net 

http://useast.ensembl.org/info/docs/variation/vep/index.html 



Wang etal. (2010) 
Ward and Kellis (2012a) 
Boyle etal. (2012) 
Ng etal. (2009) 
Cingolani etal. (2012) 
McLaren etal. (2010) 
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a disease-specific manner, and re-rank SNPs under a coherent 
statistical fi-amework. 
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